----------
Replication materials for Deadly Clerics
Rich Nielsen
Last edited: 5/10/2018
----------

This document describes the analysis in Deadly Clerics.  Most of the documentation is intended to help others reproduce the analysis, but some is for my own use, to help me remember how things unfolded.  I have tried to make the documentation accurate, but the analysis developed over many years, on multiple machines, and with varying levels of organization and documentation.

There is a smaller version of the archive available for download, to enable access to almost all of the evidence mentioned in the book without having to engage with the data collection.  The size is much smaller -- approximately 220MB instead of 3.56 GB.

Conventions: I note both the original location of many scripts (for my own use) and the file path in the replication archive.  These are denoted [original] and [archived].  Only archived paths are available in this archive.

Software: As of 5/10/2018, I the large archive includes a copy of the R 3.3.1 software along with all of the package versions I had installed at the time I finalized all of the replication code.  R is being constantly updated, along with add-on packages, so the code may not work if run on other versions of R and the auxilliary packages.  If you have a current version of R installed, then copying these files into the directory containing your current version of R (on a Windows machine, this will usually be at C:\Program Files\R).  Then, run "Rgui.exe" (at \R-3.3.1\bin\x64\Rgui.exe for 64-bit systems, R-3.3.1\bin\i386\Rgui.exe for 32-bit systems).

Table of Contents:
----------
Data Collection
1) Constructing a web census of clerics  
2) Sampling from the web census of clerics
3) Collecting writings for the web sample of clerics
4) Collecting biographical data for the web sample of clerics
5) Salafi oversample
6) Fieldwork (maps)  

Data Analysis
7) Characteristics of cleric biographies
8) Topic models
9) Detecting jihadist clerics
10) Regressions predicting jihadism
11) Analysis of fatwa page views

----------

Get started:
First, download and unpack the archive. The archive comes as a .tar file. One needs to extract the .zip file and then unzip it using standard software (such as 7zip). Note that due to the large size of the archive, this process may take up to several hours.
The unzipped archive contains approximately 98,300 files in 739 subdirectories and is about 3.56 GB in size.
 

If you want to understand the data collection, start from section 1.
If you want to do the data analysis and replicate results from the book, skip down to section 7.


--------------------------------------
Data Collection
--------------------------------------

----------
1 Constructing a web census of clerics
----------

The analysis relies on sampling from a census of clerics on the Arabic-speaking internet.  To create this census, I created lists of entities that were likely to be clerics using several sources: Google autosuggest, google searches, aggregator sites, Wikipedia.  I then identified and removed duplicates. This section documents the process.

----------
1.1 Google autosuggest
----------

I created my first list based on Google autosuggest. I developed some early scripts for querying the google autosuggest API, saved at 
    
    [original] C:\Users\Richard Nielsen\Desktop\Papers\ILM [original]

I tried a few scripts before eventually getting to one that worked.  I queried the API using this python script:
    
    [original] C:\Users\Richard Nielsen\Desktop\Papers\ILM\google autosuggest api_18jun2014.py
    [archived] E:\analysis_archive\web_census\google_autosuggest\google autosuggest api_18jun2014.py
    [archived] E:\analysis_archive\web_census\google_autosuggest\google autosuggest api_18jun2014_python36.py

The script produces a text document with 12,643 candidate names saved here:
    [original] C:\Users\Richard Nielsen\Desktop\Papers\ILM\autosuggest results_18jun2014.txt
    [archived] E:\analysis_archive\web_census\google_autosuggest\autosuggest results_18jun2014.txt

This file is pulled in by the script that disambiguates duplicate clerics (see section 1.5 below).  Note that re-running "google autosuggest api_18jun2014.py" will not reproduce this file precisely because google is constantly updating its autosuggest results.

I have left these scripts and files exactly as they were when I used them, which means that commenting is poor.

----------
1.2 Cleric websites from Google searches
----------
I also collected a list of cleric names by manually conducting a series of google searches for terms that frequently occur in cleric websites. I then had RAs look through each to make sure that the sites were in fact cleric websites. The original google search results are saved here, spread out over 8 documents in PDF and word format in this folder:

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\wikipedia
    [archived] E:\analysis_archive\web_census\google_search\searches

The coding rules I sent to my RA for identifying cleric websites are at:

    [archived] E:\analysis_archive\web_census\google_search\coding rules for identifying cleric websites.txt

This process led to the elimination of many candidates and sites. The file that I read in to the disambiguation script with a list of about 800 cleric sites (see section 1.5 below) was:

    [original] C:/Users/Richard Nielsen/Dropbox (MIT)/RA/Emma_UROP/cleric database/cleric websites_4jun2014_test.xls
    [archived] E:\analysis_archive\web_census\google_search\cleric websites_4jun2014_test.xls

----------
1.3 Cleric lists on aggregator sites
----------

I identified lists of clerics on Islamic aggregator websites and collected the names.  The related files are in this directory:
   
   [original] C:\Users\Richard Nielsen\Dropbox\RA\Emma_UROP\cleric database\cleric lists
   [archived] E:\analysis_archive\web_census\aggregator_site_lists\cleric lists

Most of the subdirectories in this directory contain a short script that I used to scrape the relevant list and then a .csv file that contains the list of clerics.  A few of the subdirectories contain lists that I didn't end up using so they are incomplete. In the disambiguation script, I reference 24 of the 31 folders.

An example of one script is:

   [archived, but not cleaned] E:\analysis_archive\web_census\aggregator_site_lists\cleric lists\alukah list 2\alukah2_27jun2014.R
   [archived and cleaned] E:\analysis_archive\web_census\aggregator_site_lists\cleric lists\alukah list 2\alukah2_27jun2014_cleaned.R

Which creates a file with names of over 18,000 clerics:

   [archived] E:\analysis_archive\web_census\aggregator_site_lists\cleric lists\alukah list 2\alukah2_27jun2014.csv
 	 
I have left the other scripts and files exactly as they were when I used them, which means that commenting is poor.

----------
1.4 Wikipedia Spider
----------

I also created a list of clerics on wikipedia by creating a "spider" that crawls article links on wikipedia and classifies articles as being about a cleric or not. I tried a few iterations of the spider over a number of years, but the final version is this one:

    [original] C:\Users\Richard Nielsen\Dropbox\RA\Emma_UROP\cleric database\wikipedia
    [archived] E:\analysis_archive\web_census\wikipedia_spider\wikipedia

This directory has a lot of scripts from various attempts and plans.  The run that I actually used is saved here:

    [archived] E:\analysis_archive\web_census\wikipedia_spider\wikipedia\spider\run_19jun2014

The script that executed this run is here:

    [archived] E:\analysis_archive\web_census\wikipedia_spider\wikipedia\spider\wikipedia spider_19jun2014.R 
    [archived and cleaned] E:\analysis_archive\web_census\wikipedia_spider\wikipedia\spider\wikipedia spider_19jun2014_cleaned.R 

Note that this script uses the saved object:
    
    [archived] E:\analysis_archive\web_census\wikipedia_spider\wikipedia\Original rf plan\rfModel workspace_18jun2014.RData

This object was created with this script:

    [archived] E:\analysis_archive\web_census\wikipedia_spider\wikipedia\Original rf plan\random forest model fitting_18jun2014 - rich working.R

There are readme files that have some useful notes about which scripts and directories are actually important.

    [archived] E:\analysis_archive\web_census\wikipedia_spider\wikipedia\readme.txt
    [archived] E:\analysis_archive\web_census\wikipedia_spider\wikipedia\spider\readme.txt

The wikipedia spider was trained to classify wikipedia articles using an initial hand coding of a large number of wikipedia articles from an earlier version of the spider.  The coding was done by my RA, Marsin Alshamary. The hand-coding results are saved in:

    [archived] E:\analysis_archive\web_census\wikipedia_spider\wikipedia\coding4.csv

Marsin's original coding was saved in this folder:

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\wikipedia\marsin coding_19aug2013
    [archived] E:\analysis_archive\web_census\wikipedia_spider\marsin coding_19aug2013

The articles that Marsin coded were collected during a much earlier wikipedia scrape:

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\wikipedia\old scrapes_pre 19oct2013\wikipedia scrape_thirdCompleteRun_29jan5.01pm
    [archived] E:\analysis_archive\web_census\wikipedia_spider\wikipedia scrape_thirdCompleteRun_29jan5.01pm

I processed these files for marsin to code using: 

    [archive] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_book\web_census\wikipedia_spider\marsin coding_19aug2013\prep for marsin_19aug2013.py

I have left these scripts and files exactly as they were when I used them, which means that commenting is poor.

----------
1.5 Disambiguating cleric names
----------

With the lists of possible cleric names from various sources in hand, I then tried to disambiguate them before sampling from the list. The disambiguation originally occurred in this folder: 

    [original] C:\Users\Richard Nielsen\Dropbox (MIT)\RA\Emma_UROP\cleric database\New Name Disambiguator

which is a subdirectory of a dir with several prior disambiguation efforts:
    
    [original] C:\Users\Richard Nielsen\Dropbox (MIT)\RA\Emma_UROP\cleric database

The disambiguation materials are now saved here:
    [archived] E:\analysis_archive\web_census\disambiguation\New Name Disambiguator

The main script that does the disambiguation was this one.

    [archived] E:\analysis_archive\web_census\disambiguation\New Name Disambiguator\new name disambiguator setup_11oct2014.R

...but I've updated the directories in this version:

    [archived] E:\analysis_archive\web_census\disambiguation\New Name Disambiguator\new name disambiguator setup_11oct2014_modified_24apr2016.R

Note the this script references several GUI R scripts which are interactive.  I used these to do the  time-consuming process of disambiguating likely duplicates by hand, using common sense and google searches. The final results of the disambiguation process, which contains the names of 10,161 clerics, are saved in this R workspace:

    [archived] E:\analysis_archive\web_census\disambiguation\disambiguated names workspace.RData



----------
2 Sampling from the web census of clerics
----------

The disambiguation process produced a list of clerics that I thought was pretty good so I started working with it in this script:

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\website collection\websites from names_summer2015\search for cleric websites_27may2015.R
    [archived] E:\analysis_archive\web_sample\search for cleric websites_27may2015.R

...and modified so that the paths work at:
    
    [archived] E:\analysis_archive\web_sample\search for cleric websites_27may2015_modified_24apr2016.R

In this script, I created a file web_sample\test.txt that served as a template for a file called

    [archived] E:\analysis_archive\web_sample\data_sets\websites identified.txt

that I iteratively modified by hand, adding websites for each cleric.  The workflow was not great and is hard to reproduce (see below), but I couldn't find a way to do it while keeping a complete record of all changes, so websites identified.txt is the original and there are two identical backups:

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\website collection\websites from names_summer2015\websites identified - Copy.txt
    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\website collection\websites from names_summer2015\websites identified - Copy (2).txt

Elaboration on: E:\analysis_archive\web_sample\search for cleric websites_27may2015_modified_24apr2016.R:

I carried out a very convoluted process of googling clerics by hand, then trying to get weblinks for them from the original sources where I got their names (but it can be hard to match up). What happened is this: I googled about a third of the 10,000 clerics. In the process, I was also cleaning up some of the entries and making other minor changes. Then realized that I could save myself a lot of time by matching up the links I already had for them.  I stopped googling and created a way of matching up the links with names in a new spreadsheet where I could then copy the relevant column into the main working spreadsheet (websites identified.txt) by hand (poor form, but it was easy at the time). Once I had the links in there for many of them, I went back and googled the rest, but only if they didn't have a good link.  The idea was to save myself time because I really wasn't sure what the best workflow for disambiguating and making sure that all clerics had an "online presence" would be. The script web_sample\search for cleric websites_27may2015_modified_24apr2016.R  has some of the paths modified but not all.  I'm not even sure it's worth working through because it's a record of what happened, but isn't enough to reproduce the googling process. Really, a complex set of things happened and I produced a final "websites identified.txt" file.

After the googling process, I then moved to this script:

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\website collection\websites from names_summer2015\clean website list and sample_9jun2015.R
    [archived] E:\analysis_archive\web_sample\clean website list and sample_9jun2015.R

...and with a modifed version:

    [archived] E:\analysis_archive\web_sample\clean website list and sample_9jun2015_modified_26apr2016.R

This script is where I first did some disambiguation based on web links, then sampled the individuals (randomly ordering). Then, I wrote out a file where I manually coded whether they were eligible based on six criteria (such as writing in Arabic and having accessible text output described in the book). Then, I read back in the manually coded file to see what I've got. Finally, this script:

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\website collection\websites from names_summer2015\generate data collection templates_18jun2015.R
    [archived] E:\analysis_archive\web_sample\generate data collection templates_18jun2015.R

made the template file that I used for hand-coding the data, called 

    [archived] E:\analysis_archive\web_sample\fastcoding_18jun2015.txt

which I then converted to utf-8 format by hand after the coding was complete:

    [archived] E:\analysis_archive\web_sample\fastcoding_18jun2015_utf8.txt

On May 5, 2016, I decided to increase the sample size from 182 (first book submission) to an even 200, so I did more eligibility coding of clerics on my list. This happens in the script: 

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_book\web_sample\assist with 5may2016 eligibility coding_5may2016.R
    [archived] E:\analysis_archive\web_sample\aux_scripts\assist with 5may2016 eligibility coding_5may2016.R

and in manual coding of:

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_book\web_sample\sampled clerics_5may2016_eligibility coding.txt"

I first created "sampled clerics_5may2016_eligibility coding.txt" as an exact copy of "sampled clerics_9jun2015_eligibility coding.txt".  I then went through the following code to figure out which of the clerics I thought were eligible in the quick eligibility check that ended up not being eligible when I did the fast coding.  I have most of the links I used for eligibility coding in the spreadsheet itself ("sampled clerics_5may2016_eligibility coding.txt")

Some of the clerics I added on May 5th didn't have text, however, so I had to add even more, which resulted in the file:

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_book\web_sample\sampled clerics_31may2016_eligibility coding.txt" which has

This still wasn't quite enough. On 7/4/2016 I figured out I needed to code one more cleric to get to 200.  I modified 

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_book\web_sample\sampled clerics_31may2016_eligibility coding.txt 

by hand to get one more and rather than updating the script: 
 
    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_book\web_sample\generate data collection templates_18jun2015.R

I updated the following files by hand:
    
    [original] C:/Users/Richard Nielsen/Desktop/Papers/Fatawa/analysis_book/web_sample/text/cleric website checking_10may2016_noarabic_3columns.csv
    [original] C:/Users/Richard Nielsen/Desktop/Papers/Fatawa/analysis_book/web_sample/fastcoding_7may2016.txt
    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_book\web_sample\Clerics.coded.Revised_8jun2016.xlsm

    [archived] E:\analysis_archive\web_sample\text\cleric website checking_15aug2016_noarabic_3columns.csv
    [archived] E:\analysis_archive\web_sample\data_sets\fastcoding_7may2016.txt
    [archived] E:\analysis_archive\web_sample\data_sets\Clerics.coded.Revised_8jun2016.xlsm

These are the data sets that the subsequent scripts in the data merging process expect.

----------
3 Collecting writing for the web sample of clerics
----------

Collecting the writings of the clerics in my sample was a complicated process and difficult to fully document.  Vincent Bauer worked as an RA for me over the summer of 2015 and did a lot of the organization and scraping.

The spreadsheet:
 
    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_book\web_sample\text\cleric website checking_15aug2016_noarabic_3columns.csv
    [archived] E:\analysis_archive\web_sample\text\cleric website checking_15aug2016_noarabic_3columns.csv

has the locations of where the text for each cleric is saved.  It looks like I (manually?) noted the source from which to draw texts in this spreadsheet: 
   
    [original] "C:/Users/Richard Nielsen/Desktop/Papers/Fatawa/website collection/websites from names_summer2015/cleric website checking_1jan2016_noarabic.txt

using the column "best.source.for.text" in this spreadsheet.  I did it by hand as I was going through the complete list of clerics, with multiple sources separated by semicolons.  I updated it in "cleric website checking_10may2016_noarabic_3columns.csv" which includes some hand edits to solve the fact that I had a cleric in the sample twice (what are the odds!): "a7md 3ysA" and "a7md 3ysA alm3Srawy" but that I had different text sources for each of them.  I copied and pasted so that both entries have the same sources on July 3, 2016.  On 15 aug 2016, I realized that I had only two documents for "3bd alr7mn bn Sal7 alaTrm" from Islamway and they were both very strange, making him get classified as a jihadist. I went to his website and got more representative writings, resulting in: 

    [original]C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_book\web_sample\text\cleric website checking_15aug2016_noarabic_3columns.csv
    [archived] E:\analysis_archive\web_sample\text\cleric website checking_15aug2016_noarabic_3columns.csv

A lot of the model scripts I used to collect the texts from various formats are at 
    
    [original] E:\websites\scripts
    [archived] E:\analysis_archive\web_sample\text\scraping_scripts

These scripts aren't intended to run as part of the archive, but are here for documentation purposes.  I didn't save copies of the scripts for every scrape I did. 

An example of one script is:

   [archived] E:\analysis_archive\web_sample\text\scraping_scripts\alukah scrape example.R

Which has example code for scraping text from the alukah web site, the source for over 30 out of the 200 clerics. 

Another script converts word documents to text format:

   [archived] E:\analysis_archive\web_sample\text\scraping_scripts\convert word files to text.R

For clerics with writings on tawhed.ws or shamela that I've already scraped and converted, I just got the text from places where I had it stored: 

   [original] E:\websites\www.tawhed.ws 
   [original] E:\shamela


----------
4 Collecting biographical data for the web sample of clerics
----------

I identified biographies for the clerics in the web sample using google searches. Saving the biographies in their original web format can sometimes be difficult, so they are saved in flat text files, text only, with a web link for each biography above the entry and commented.
I started with the links in "sampled clerics_5may2016_eligibility coding.txt" but googled further to try to (mostly) exhaustively get all the bios that are online for each cleric.

Coding biographical information from the biographies: The coding was a bit of a complicated process.
I did a "fast coding" that resulted in "fastcoding_18jun2015.txt."  This was "fast" in the sense that I wasn't very careful.  I often rounded the number of teachers rather than counting it exactly. I then had Chase Rennick do a detailed coding.  I think he did a good job, though we had some discrepancies where I had more teachers than he did.  The final corrected copy is 

    Clerics.coded.Revised_with teacher discrepancies_30nov2015_chase corrected.xlsm

I then updated this to include the new clerics I added to get to a full 200 for the start of the analysis.  The final biography coding is here:

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_book\web_sample\Clerics.coded.Revised_8jun2016.xlsm
    [archived] E:\analysis_archive\web_sample\data_sets\

On 15 aug 2016, I decided that there were too many discrepancies between my coding and Chase's coding of the number of insider appointments, so I decided to take a different approach and recode it all with just my qualitative assessment of whether they are an insider or not.

I made a coding template with:

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_book\web_sample\make insider career coding template_15aug2016.R
    [archived] E:\analysis_archive\web_sample\aux_scripts\make insider career coding template_15aug2016.R

which makes this raw file:

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_book\web_sample\insider discrepancies_15aug2016.csv
    [archived] E:\analysis_archive\web_sample\data_sets\insider discrepancies_15aug2016.csv

I did the coding by hand in this file:

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_book\web_sample\insider discrepancies_15aug2016.xlsx
    [archived] E:\analysis_archive\web_sample\data_sets\insider discrepancies_15aug2016.xlsx

Note to myself: before I was working in the "analysis_book" subdir, I was working for a long time in:

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\website collection\websites from names_summer2015

If I can't find something, check there.


----------
5 Collecting the Salafi oversample
----------

The Salafi oversample comes from an ad hoc sampling procedure I used in my dissertation research, before I constructed a systematic sample using a census of clerics on the web.  It is the oldest part of the project and the hardest to document because the data came together over approximately seven years and my personal documentation is not great.  

The directories saved at 

    [archived] E:\analysis_archive\salafi_oversample\bios_with_sources_and_coding

are named with the transliterated Arabic names of the clerics in the oversample, and each subdirectory contains (typically) the following contents:
a) a directory containing the text of each biography I collected for that individual, copied verbatim from the sources listed in "sources.txt".  I have the original html in my own files but I did not include it in the archive.
b) a document called "sources.txt" that lists the sources I used for each cleric
c) a document called "bio coding.txt" in which I hand coded variables about each cleric from their biographies
d) a script called "bio coding to R.R" that is a modified version of "bio coding.txt" meant to be sourced as part of an R script that merges the coded variables.

For my own notes, these files were originally in the directory

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\cleric database\c\

where the subdirectories also contained a subdirectory called "fatwas" that had the text written by each cleric, along with the scripts that collected the text and 

The texts are saved in the archive at 

    [archived] E:\analysis_archive\detecting_jihadist_clerics\salafi_oversample_text\c

but not the scripts that collected them.  Each is in a subdirectory called "prestem" which is an artifact of how I used to do the stemming. These are the texts before stemming.


----------
6 Fieldwork Maps
----------

Reproduction of map of al-Azhar library from my field notes

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\cairo june 2011\field notes cairo june 2011.pdf
    [archived] E:\analysis_archive\fieldnote_selections\map of al-Azhar library shelves 2011.pdf

Reproduction of map of al-Azhar prayer space from my field notes

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\cairo june 2011\field notes cairo june 2011.pdf
    [archived] E:\analysis_archive\fieldnote_selections\al-Azhar prayer space size calculation.pdf



--------------------------------------
Data Analysis
--------------------------------------

----------
7 Characteristics of Muslim Clerics
----------

These scripts provide and analyze the biographical materials for Chapter 4 of my book.

Most of the analysis is done in the script below. The script characterizes the clerics using topic models, k-means clustering, and word frequencies in their bios. 
This includes a cluster analysis of web sample cleric biographies (Figure 4.1), summary statistics including academic terms that clerics use (Table 4.1), 
and a map of cleric locations (Figure 4.2).

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_book\web_sample\cluster bios_25jul2016.R
    [archived] E:\analysis_archive\web_sample\cluster bios_25jul2016.R

Plots of cleric degrees by date of birth (Figure 4.3 in book) are made in this script:

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_book\web_sample\cleric degrees by birthdate_27jul2016.R
    [archived] E:\analysis_archive\web_sample\cleric degrees by birthdate_27jul2016.R


----------
8 Topic Models 
----------

These scripts gave me topics for my different Jihadist writings. 

----------
8.1 Fatwa topic analysis from islamweb.net 
----------

Data prep happened in a series of scripts originally at C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\www.islamweb.net\scripts. These scripts are not included in the archive due to possible copyright issues. 

The script below creates Figure 2.1: The Topics of a Large Fatwa Corpus. Note that topics are determined by the website administrators.
 
    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\www.islamweb.net\scripts\play with the data.R
    [archived] E:\analysis_archive\islamweb_fatwas\islamweb analysis.R


----------
8.2 Topic model of jihadist bookbag 
----------
This code creates Figure 5.1 and the materials used in Appendix B3: Assigning Topic Labels to a Topic Model of the Jihadist's Bookbag

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_diss\lda bookbag.R
    [archived] E:\analysis_archive\detecting_jihadist_clerics\lda bookbag.R

This script at some point refers to
    
    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\cleric database\h\scripts\match docs to authors.py
    [archived] E:\analysis_archive\detecting_jihadist_clerics\aux_scripts\match docs to authors.py

which matches authors to their texts in the jihadist's bookbag.  This script in turn calls an old python version of my Arabic stemmer

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\cleric database\h\scripts\light10.py
    [archived] E:\analysis_archive\detecting_jihadist_clerics\aux_scripts\light10.py

The jihadist's bookbag texts are saved at 
    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\cleric database\h
    [backup] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_book\h

The bookbag comes as a zipped directory of files that I am not including in the archive because
of potential legal complications.  The file:

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_book\h\readme.txt

describes how I opened the zipped dir: "I unzipped it with 7zip on my office computer.  Two of the files were deleted as "e-book" spyware.  The remaining unzipped files are in "bag_mgahed original"."

Then the readme describes "The first thing is to get all of the documents I can into text.  It's easier to work with the word files, so I use the script 
"~/cleric database/h/scripts/word_to_txt.py" to change all the word files in the bookbag and save them in ~/h/t
This script is now saved at
    
    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\analysis_book\h\scripts\word_to_txt.py
    [archived] E:\analysis_archive\detecting_jihadist_clerics\training_corpus_text\h\word_to_txt.py

though it won't work in the archive because the original files aren't there.

The old readme then notes that "Then, the processing and stemming happens in the "parse all database fatwas.py""

For the supervised learning in the next section, I redid the stemming using the latest stemmer, but for the analysis in this section, I used the stemming from this script, which processed all of the text for the dissertation version of this research.  This script is at:

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\cleric database\scripts\parse all database fatwas.py
    [archived] E:\analysis_archive\detecting_jihadist_clerics\training_corpus_text\h\parse all database fatwas.py

It won't work in the archive because the source files aren't included, but I want it there for completeness.  It produced the files in the directory:

    [original] C:\Users\Richard Nielsen\Desktop\Papers\Fatawa\cleric database\h\stemmed

which are now copied into the archive:

    [archived] E:\analysis_archive\detecting_jihadist_clerics\training_corpus_text\h\stemmed



----------
9 Detecting jihadist clerics
----------

This script does the text classification that produces the cleric jihad scores.  It also produces the word cloud illustrating the words that matter in the classifier model, and calculates various measures of cleric vocabulary size for later analysis.

    [archived] E:\analysis_archive\detecting_jihadist_clerics\onedoc jihad scores_7jan2017.R
    (Note: this file is the original because I corrected minor errors while archiving the data analysis)

This script does the same things as the previous script but for the salafi oversample.

    [archived] E:\analysis_archive\detecting_jihadist_clerics\onedoc scores for jihadists in oversample_7jan2017.R
    (Note: this file is the original because I corrected minor errors while archiving the data analysis)

Alternative jihad scores using tawhed.ws as the training set are made in this script:
    [original] E:\analysis_archive\detecting_jihadist_clerics\onedoc jihad scores with tawhed training docs_7jan2017.R
    (Note: this file is the original because I corrected minor errors while archiving the data analysis)

This script validates the jihad scores against various expert codings. Creates Figures 5.3 and 5.4 from the book.

    [archived] E:\analysis_archive\detecting_jihadist_clerics\validation_7jan2017.R
    (Note: this file is the original because I corrected minor errors while archiving the data analysis)

This script creates a figure ranking jihadists by their jihad scores.
    [archived] E:\analysis_archive\detecting_jihadist_clerics\rank jihadists_10feb2017.R
    (Note: this file is the original)

This script combines the data from various sources and formats it for validation and for the main regression analysis.
    
    [archived] E:\analysis_archive\web_sample\prep data_17jan2017.R
    (Note: this file is the original because I corrected minor errors while archiving the data analysis)

This "prep data_17jan2017.R" script also relies on some objects that are created elsewhere.  In particular, the salafi oversample is based on data that was compiled as part of my dissertation analysis and subsequent revisions.  The object that comes into the script is called "jihaddat_inmysample.dta".  To see how to create it, use the script:
    
    [archived] E:\analysis_archive\salafi_oversample\making_jihaddat\make covariates.R
    (Note: this file is the original because I corrected minor errors while archiving the data analysis)

----------
10 Regressions predicting jihadism
----------

The code for the regressions reported in Chapter six. 

This script creates Tables 6.1-6.2 and Figures 6.1-6.2.

    [archived] E:\analysis_archive\web_sample\web sample analysis_7jan2017.R
    (Note: this file is the original because I corrected minor errors while archiving the data analysis)


----------
11 Analysis of fatwa page views
----------
Analysis of average page views for islamic way fatwas. Creates Table 6.5 in book.

    [archived] E:\analysis_archive\islamwayclicks\islamway clicks_7jan2017.R
    (the archived is the original because I fixed minor errors while archiving)








